Hephaestus: Data Reuse for Accelerating Scientific Discovery

نویسندگان

  • Jennie Duggan
  • Michael L. Brodie
چکیده

Data-intensive science, wherein domain experts use big data analytics in the course of their research, is becoming increasingly common in the physical and social sciences. Moreover, data reuse is becoming the new normal, owing to the open data movement [15] and arrival of big science experiments such as the Large Hadron Collider. Here, a small group of researchers with exotic equipment produce a dataset that is shared by thousands. Unfortunately, weak and spurious correlations are also on the rise in research [5, 27]. For example, Google Flu Trends published their algorithms in 2008 [19] for use in public health, and in the intervening time its accuracy has plummeted. In the 2011-2012 flu season, this system produced estimates more than 50% higher than the number of cases reported by the U.S. Center for Disease Control [32]. This work first examines common pitfalls associated with data-intensive science and how they contribute to irreproducible results. We then propose a system for conducting virtual experiments over existing data. It simulates randomized controlled trials by reframing the principles of empirical research. These virtual experiments underpin a larger platform we call Hephaestus. This framework accumulates virtual experiments in a visualization to help scientists identify consistencies and anomalies in an area of research. We then highlight a set of research challenges associated with this platform. We argue that by using this approach, dataintensive science may come to achieve accuracy on par with its causality-driven predecessors.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

The forgotten role of methenamine to prevent recurrent urinary tract infection: urgency for reuse 100 years after discovery

    In conclusion, UTI is a globally distributed disease with emergence of multidrug-resistant bacteria; thus, new drug discovery or activation of previously used drugs is an urgent issue. Methenamine has been suggested as a beneficial agent for the UTI prevention as it works as a urinary antiseptic by safely producing formaldehyde to prevent bacterial growth while avoiding bacterial resistance...

متن کامل

Investigating Embedded Question Reuse in Question Answering

The investigation presented in this paper is a novel method in question answering (QA) that enables a QA system to gain performance through reuse of information in the answer to one question to answer another related question. Our analysis shows that a pair of question in a general open domain QA can have embedding relation through their mentions of noun phrase expressions. We present methods f...

متن کامل

Integrating semantic and syntactic descriptions for chaining geographic services

Accelerating the development of complex and heterogeneous geographic services requires improved methods that integrate service discovery, composition, and reuse. We present an integrated use of semantic and syntactic service descriptions for service chaining, by combining an application that supports service discovery and abstract composition, with another that supports concrete composition and...

متن کامل

Managing SPL Variabilities in UAV Simulink Models with Pure: : variants and Hephaestus

Unmanned Aerial Vehicles (UAV) are vehicles that fly without a pilot and are able to execute different types of missions, such as surveillance, topographical data collection, and environment monitoring. This motivates some degree of variability in the controlling software of UAV – usually specified using Simulink models –, even though it is also possible to reuse software in this domain using s...

متن کامل

The Promise and Potential of Big Data: A Case for Discovery Informatics

The emergence of “big data” offers unprecedented opportunities for not only accelerating scientific advances, but also enabling new modes of discovery. While we understand how to automate routine aspects of data management and analytics, most elements of the scientific process currently require considerable human expertise and effort. We argue that realizing the full potential of data to accele...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2015